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ABSTRACT 

Earlier studies liave revealed a substantial amount 
of transcriptional activity occurring outside anno- 
tated protein-coding genes of the Caenorhabditis 
elegans genome. One important fraction of this 
transcriptional activity relates to intermediate- 
size (70-500 nt) transcripts (is-ncRNAs) of mostly 
unknown function. Profiling the expression of this 
segment of the transcriptome on a tiling array 
through the C. elegans life cycle identified 5866 
hitherto unannotated transcripts. The novel loci 
were distributed across intronic and intergenic 
space, with some enrichment toward protein- 
coding gene termini. The majority of the putative 
is-ncRNAs showed either stage-specific expression, 
or distinct developmental variation in their ex- 
pression levels. More than 200 loci showed 
male-specific expression, and conserved loci were 
significantly enriched on the X chromosome, both 
observations strongly suggesting involvement of 
is-ncRNAs in sex-specific functions. Half of the 
novel loci were conserved in other nematodes, and 
numerous loci showed significant conservational 
correlations to nearby coding genes. Assuming 
functional roles for most of the novel loci, the data 
imply a nematode is-ncRNA tool kit of considerable 
size and variety. 



INTRODUCTION 

Recent years have seen increasing efforts toward the un- 
raveling of the functional roles of non-protein coding 
RNAs (ncRNAs) in organismal development. Non- 
coding RNAs have broadly been divided into small 
(<200nt) and long (>200nt) transcripts (1), and 
research has been particularly intense on microRNAs 
and other RNAs ranging between 15 and 40 nt in size. 
Simultaneously, increasing efforts are being made to in- 
vestigate the roles of many long and mRN A-like ncRN As 
found in mammahan transcriptomes (2). However, eu- 
karyote transcriptomes are also composed of several 
classes of transcripts whose size range spans the border 
between small and long RNA. For practical purposes we 
will, in the following text, refer to transcripts in this size 
range (70-500 nt) as 'intermediate-size ncRNAs' 
(is-ncRNAs). Such ncRNAs include the well-studied 
snRNAs and snoRNAs, but it has also been known 
since early this century that this transcript range also com- 
prises numerous other transcripts with less well-defined 
roles (3,4). Large-scale transcriptome analyses by tihng 
array or deep sequencing have recently demonstrated 
the existence of considerable numbers of transcripts in 
this size range in all investigated organisms (5-7). Very 
many of the intermediate-size transcripts in eukaryotes 
appear to occur in the context of protein coding loci and 
are being referred to under various denominations, such 
as PASRs (promoter-associated short RNAs), TASRs 
(terminator-associated short RNAs), CUTs (cryptic 



*To whom correspondence should be addressed. Tel: +86 64888543, Fax: +86 10 64889892; Email: crs(a:sun5. ibp.ac.cn 
Correspondence may also be addressed to Geir Skogerbo. Tel: +86 64888543, Fax: +86 10 64889892; Email: zgb@moon.ibp.ac.cn 

The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. 

© The Author(s) 2011. Published by Oxford University Press. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecomnions.org/licenses/ 
by-nc/2.5), which pennits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. 



5204 Nucleic Acids Research, 2011, Vol. 39, No. 12 



unstable transcripts), SUTs (stable unannotated tran- 
scripts), PROMTS (transcripts upstream of core pro- 
moters) and eRNAs (enhancer RNA) (8-12), depending 
on the organism of origin, size range and specific genomic 
location. However, even after very stringent filtering of 
tiUng array and RNA-Seq data, there remained several 
thousand putative is-ncRNAs in mammahan transcrip- 
tomes that could not be accounted for in this way (13). 

With the exception of snRNAs, snoRNAs and a few 
others, whose cellular roles were largely established in 
the final decades of the 20th century, the functional 
properties of is-ncRNAs are just beginning to be 
touched upon. Transcripts arising around and in concert 
with transcription of protein coding loci may be involved 
in transcription activation of the coding loci, or simply be 
'by-products' of such activation (14—16). It is probably 
premature to explain all coding locus-associated transcrip- 
tion in this way, and is-ncRNA genes located in deep 
intergenic space can hardly be thus accounted for. On 
the contrary, there is compelling evidence that the large 
numbers of identified but yet unstudied non-coding tran- 
scripts have intrinsic functionality, as indicated by the 
conservation of their promoters, structures, genomic 
position and expression patterns (17-20). Investigations 
into a number of is-ncRNAs in C. elegans suggested that 
on the one hand, they are fairly recalcitrant to 
knock-down by RNAi, and on the other hand, their 
cellular stability largely depends on interactions with 
proteins or protein complexes (21). This would suggest 
that transcripts in this size range exert their functions in 
stable ribonucleoproteins which, in addition to being 
vehicles of their cellular function, also confer resistance 
to cellular ribonucleases. In the former respect, 
is-ncRNAs may resemble miRNAs in that they link the 
digital information of the nucleotide to the analogue in- 
formation of protein structure (22). Given their sheer 
numbers (apparently in the thousands) and the relatively 
low research effort invested in elucidating their functional 
roles, is-ncRNAs have the potential to fill a regulatory 
space of a magnitude similar to that occupied by 
microRNAs. 

The current annotation of the wsl90 data of C. elegans 
genome estimated ~20 000 protein coding genes and ~900 
intermediated sized (70-500 bp) ncRNA genes (23,24) and 
computational predictions have suggested the presence of 
an additional 3000-4000 is-ncRNAs in the genome (4,25). 
We have previously carried out a tiUng array analysis 
which identified approximately 1200 novel intermediate- 
size transcripts in a mixed stage culture of C. elegans (5). 
Much of the mammahan tihng array data have not stood 
up well to scrutiny in Ught of deep sequencing data (13); 
however, careful analyses of both methodologies in 
C. elegans demonstrate that the tiling array compares 
well with deep sequencing when necessary measures are 
in place (26). We, therefore, applied tiling array analysis 
to six developmental and two conditional stages of the 
nematode, detecting 5866 novel intermediate-size tran- 
scripts [or transcribed fragments (transfrags) of 
unknown function; TUFs]. Fifty-two (85%) of 59 tested 
TUFs were verifiable by reverse transcription-polymerase 
chain reaction (RT-PCR) and an additional 10 of 10 



TUFs were verifiable by Northern blot and RACE experi- 
ments. These TUFs exhibited more complex expression 
patterns across stages, and most showed features different 
from that of known is-ncRNAs types and coding genes, 
suggesting the existence of novel functional types of 
intermediate-size RNAs. 

MATERIALS AND METHODS 

Preparation of RNA samples and tiling array 

RNA samples were prepared from wild-type N2 strain 
worms at larval stages 1-4 (L1-L4), mature adult (MA) 
and male (ML) worms, dauer stage worms (DU), and 
worms subjected to heat-shock (HS). Total RNA was ex- 
tracted from each of the eight different developmental 
stages and environmental conditions according to the 
Trizol (Invitrogen) protocol. Intermediate-size RNAs 
(70-500nt) were isolated using a QIAGEN tip (Qiagen), 
and remaining rRNAs were removed by adapting the 
MicrobExpress kits (Ambion). The enriched is-RNAs 
were dephosphorylated with CIAP (Fermentas) and then 
hgated to the 3'-adaptor oHgonucleotide by T4 RNA 
ligase (Fermentas). Each RNA sample was reverse 
transcribed using random hexamers and a primer comple- 
mentary to the 3'-adaptor. Double-strand cDNA was 
fractioned, labeled and hybridized to the Affynietrix 
GeneChip® C. elegans Tihng l.OR Array according to 
Affymetrix's GeneChip Whole Transcript (WT) Double- 
Stranded Target Assay Manual. RNA sample preparation 
and hybridization was carried out twice for each of the 
C. elegans stages or conditions, except for MLs and MAs, 
which only were sampled once. 

Computational analysis 

The genome annotation, sequence and conservation data 
were downloaded from Wormbase (http://www 
.wormbase.org, version WS190) (23) and the UCSC 
genome browser (http://genome.ucsc.edu/, version ce6) 
(27). The raw tihng array data was pre-processed using 
the Affymetrix Tihng Analysis Software (TAS, version 
1.1.02). Briefly, quantile-normalization was performed 
on the tiling array replicates and signal intensity values 
were then adjusted to yield a median intensity of 100. 
Log2 [max (PM— MM, 1)] was calculated for each probe 
as an estimate of the expression level at each genomic 
position. Probe signal intensities were considered as sig- 
nificant over background if above the threshold associated 
with a false-positive rate of 0.05. Transcribed fragments 
(transfrags) were identified using a shding window method 
with window size = 100, maxgap = 30 and minrun = 70. 

For normalization within arrays, the signal intensity of 
all stages and conditions were quantile-normalized (R, 
limma package). The transfrags were filtered with the 
normalized signal by removing the ones with low signal 
intensity [threshold = 6, false detection rate (FDR) = 
0.05]. The remaining transfrags were annotated by 
mapping to known is-ncRNAs (Wormbase and other pub- 
hshed is-ncRNAs) (4,5,28), or to other annotations from 
UCSC (SangerGene annotations, pseudogene annotation 
and repeat annotations) (27), introns or unannotated 
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intergenic regions. The unannotated transfrags dataset 
was refined by removing TUFs covering probes corres- 
ponding to multiple genomic regions or with homology 
to ESTs (identity >95% and alignment >35bp). 

Chromosome location, GC content, secondary struc- 
ture (29) and development profile analyses were done 
for all TUFs. Conservation analysis was implemented 
using phastCons data (UCSC, goldenPath/ce6/ 
phastCons6way Scores). BLAST and Infernal (30) were 
performed against several non-coding RNA databases 
(28,31-34) for sequence and functional homology. 
Prediction of snoRNA was implemented using snoscan 
(35), snoReport (36) and snoGPS (37) to both strand of 
TUFs. Motif analysis was done using the MEME/MAST 
program (38,39). 

The TUF signal intensity profile data were combined 
with coding gene-expression profile data obtained from 
the Genome B.C. Candida elegans Gene Expression 
Consortium (http://elegans.bcgsc.bc.ca), and both data 
sets were quantile normahzed. 

Validation of TUFs by RT-PCR 

Total RNA was digested with DNase 1 (Fermentas), 
dephosphorylated and hgated to the 3AD (see 
Supplementary Data) oligo. Reverse transcription was 
performed by using a 3RT primer (see Supplementary 
Data). First-strand cDNA was used as template for 
PCR with a pair of TUF sequence-specific primers. 
Total RNA digested with DNase I was used as negative 
control while genomic DNA was used as positive controls. 
Total RNA without DNase I digestion was used as 
control for genomic DNA contamination. 

Northern blot and RACE 

The RNA probes used for Northern blots were labeled 
with DIG-UTP (Roche) by in vitro transcription, 
hybridized to 4|ig total RNA at 62°C overnight, and 
detected with CDP-Star (Roche). For 3' RACE, total 
RNA was ligated to the 3AD ohgo, reverse transcribed 
with the 3RT primer and PCR amplified with GSP 
primers. 5' RACE was performed using the SMART 
cDNA synthesis kit (ClonTech). Oligos used in these ex- 
periments are listed in Supplementary Document S4. 

RESULTS 

We have previously reported a survey of the genomic tran- 
scriptional activity in C. elegans which identified approxi- 
mately 1200 novel intermediate-size (70-500 nt) transcripts 
in a mixed stage worm population (5). In order to obtain a 
more detailed map of the intermediate-size RNAs through 
the C. elegans life cycle, we applied the same tiling micro- 
array approach (5) to worms in eight different develop- 
mental stages and environmental conditions. These 
included the four larval stages L1-L4, the MA stage, 
ML, worms in the DU stage, and worms exposed to HS. 
A transcribed fragment [transfrag (40)] was defined as at 
least four consecutive positive probes each separated by a 
gap of <30 bp. The data were normalized and we applied 
log2 (signal intensity) (L2SI) of six as the lower threshold 



cutoff for transfrag selection (see 'Methods' section and 
Supplementary Data for details). This rendered 32 230 
transfrags, covering 3.58 niilhon base pairs of the 
C. elegans genome, which were retained for further 
analysis. About 56.1% of the expressed base pairs had 
been annotated as either coding sequences or untranslated 
regions (UTRs) of coding transcripts, and 3.1% were 
annotated as either is-ncRNAs or pseudogenes (Figure 
lA). The remaining 40.8% of the transcribed nucleotides 
either locate to introns (17.4%) or intergenic regions 
(23.4%). The transfrags were distributed almost evenly 
on the six C. elegans chromosomes (Supplementary 
Figure SI). 

The tihng arrays detected 73.3% of aU known 
is-ncRNAs (Table 1). This is a lower fraction than 
reported in a previous tihng microarray assay of mixed 
stage worms (5), and owes mainly to a more stringent 
data-filtering procedure (see 'Methods' section and 
Supplementary Data for details). This effect was most 
prominent for the relatively short tRNAs, but was also 
seen for other classes of known is-ncRNAs. 
The protocol applied for sample preparation was not 
designed to detect mature miRNAs (or other simi- 
larly sized small RNAs like 21U-RNAs), but the tihng 
array nonetheless produced positive signals for 29% of 
the 138 known miRNA loci and a small number of 21U- 
RNAs loci. A comparison to previously pubhshed tiling 
array analysis of mixed stage intermediate size RNAs 
showed that our TUFs overlapped 36-53% of the 
mixed stages TUFs (see Supplementary Document S3). 
The overlapping TUFs tended to be ubiquitously ex- 
pressed through aU developmental stages and had cor- 
related {r^ = 0.47) expression levels in the two datasets 
(Figure IB). 

Only 364 transfrags were highly expressed (L2SI > 10) 
at any stage or condition, and the expression level of the 
majority of the transfrags (~21480) were expressed at 
relatively low levels (6 < L2SI < 7; Figure IC). This is 
weU below the expression level of well-estabhslied is- 
ncRNA classes, such as snRNAs and snoRNAs, which 
generally had x4-6 higher expression (Figure ID). Other 
previously verified is-ncRNAs displayed a wider distribu- 
tion of expression level and a considerable fraction 
of these fell in the same expression range as the majority 
of transfrags (Figure ID). The number of transfrags ex- 
pressed above the threshold (L2S1 > 6) at any given stage 
or condition varied significantly with nematode develop- 
ment, most transfrags being expressed in the first three 
larval stages and fewer toward maturity (Figure IE). 
The lowest number of transfrags was observed in ML 
worms. 

The 32 230 transfrags were separated into different 
categories according to their genomic locations. We com- 
pared annotation data from Wormbase (wsl90) to the 
locations of our transfrags (Supplementary Figure S2). 
Approximately 19000 transfrags wholly or partially 
overlapped exonic sequences. These could represent inde- 
pendent transcriptional unit overlapping coding genes in 
either orientation, as recently observed in other species. 
However, as the percentage of exonic transfrags declined 
markedly relative to other types of transfrags with higher 
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Figure 1. Transcribed fragments. (A) Overall nucleotide distribution of the 32 230 tiling array transfrags. (B) TUF expression. Comparison to data 
from He et al. (5). (C) Transfrag maximum log2 (signal intensity) (L2SI) distributions. (D) Known is-ncRNA maximum L2SI distribution ('Other 
is-ncRNAs' include all known is-ncRNAs except snoRNAs, snRNAs and tRNAs). (E) Nuinber of transfrag expressed in each developmental 
and conditional stage. (F) Genomic distribution of the 32 230 transfrags mapping to exons (E), introns (I) and intergenic regions upstream (U), 
downstream (D) and distant (>2kb, O) from the nearest coding gene. 



Table 1. Detection rates for known is-ncRNAs loci 



ncRNA class 


Interrogated 


Detected 


Detection 








rate (%) 


tRNA 


631 


440 


69.73 


rRNA 


21 


17 


80.95 


snoRNA 


133 


119 


89.47 


snRNA 


90 


76 


84.44 


SL2 RNA 


8 


7 


87.50 


scRNA 


1 


1 


100.00 


sm Y RNA 


1 


1 


100.00 


Uncharacterized is-ncRNAs 


60 


32 


53.33 


All interrogated RNAs 


945 


693 


73.33 


21U-RNA 


5356 


30 


0.56 


miRNA 


138 


40 


28.99 



signal intensity was increased (Figure IF), it seems reason- 
able to assume that at least a fraction of the exonic 
transfrags represent degradation products of pre- 
mRNAs and mature mRNAs, or elements that have 
been spliced from pre-mRNAs. 

In subsequent analyses, we focused on transfrags not 
overlapping with other annotated genomic elements, i.e. 
intergenic and intronic transfrags. We therefore removed 
transfrags that had probes corresponding to multiple 
genome loci and transfrags that overlapped with repeat 
sequences, exons, known is-ncRNAs and pseudogenes 



(WS190). The remaining 6552 transfrags were filtered with 
EST data and WS190 data, leaving 5866 transfrags of 
unknown function (TUFs) not overlapping any annotated 
sequences. Recent analyses of mammalian tiling array 
data (13) have suggested that most short TUFs of low 
signal intensity frequently represent cross-hybridization 
and other noise. As the RNA sample used for hybridiza- 
tion to the tiling arrays had been size fractioned and 
depleted of the most abundant RNA species, the potential 
for cross-hybridization was greatly reduced compared to 
tiling arrays hybridized with polyadenylated RNA or total 
RNA. Furthermore, TUFs consisting of one or more 
probes matching more than one genomic position were 
removed (See 'Methods' section). Validation of 59 TUFs 
randomly sampled in the low-signal intensity range 
(6 < max. L2SI < 8) by RT-PCR confirmed this by return- 
ing positive amplification for 85% of the selected TUFs 
(Supplementary Document S2). Northern blot analysis of 
10 TUFs all indicated a transcript in the expected size 
range (Figure 2), and subsequent RACE analysis showed 
that TUF and transcript size generally differed by 2-34% 
(Supplementary Document S2). The GC content was 
lower in TUFs than in known is-ncRNAs, and similar 
to that of coding exons (Figure 3A). There were no differ- 
ences between intronic and intergenic TUFs, but in both 
genomic contexts, the TUF GC content was higher than in 
the surrounding sequence. 
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Genomic organization of the TUF loci 

The chromosomal distribution of TUFs deviated from 
that of known is-ncRNA loci. Known is-ncRNAs loci 
(rRNAs and tRNAs not included) in C. elegans, are almost 
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Figure 2. Northern blot analysis of 10 TUFs. All 10 TUFs are within the 
expected 70-500 nt range. 



evenly distributed on the autosomes but scarce on the X 
chromosome. TUFs, on the other hand, were slightly 
over-represented on the X chromosome (Figure 3B), 
irrespective of developmental stage or environmental con- 
dition under which they were expressed. Intergenic TUFs 
without nearby coding genes (distant TUFs, see below) 
showed the strongest tendency to locate on the X chromo- 
some (26%). There were 4025 TUFs (69%) with an inter- 
genic location, amounting to an overaU density of about 
1 TUF/lOkb of intergenic sequence (98.6TUFs/Mb). As a 
number of recent analysis (6,7,41,42) have observed 
frequent non-coding transcription in the vicinity of 
active coding loci, we further divided the intergenic 
TUFs into two 'proximal' groups of 1895 and 858 TUFs 
located within 2 kb upstream or downstream, respectively, 
of a coding gene, and a group of 1272 distant TUFs 
located >2kb away from any gene. The density of 
proximal upstream (302.4 TUFs/Mb) and downstream 
(136.9 TUFs/Mb) was on average five times that of 
distant TUFs (45.0 TUFs/Mb). Closer analysis of 50 bp 
windows in the gene proximal sequences showed that TUF 
density peaked within the first hundred base pairs 
upstream and downstream of the WS190 annotated 
genes, reaching the highest value at ~ 1 50 bp upstream of 



1.0 



u 

O 0.4 



0.2 




B 

35 
30 
.-^25 

5 20 

o 

6 15 
10 
5 
0 



■ known is-ncRNAs 
O intronic TUFs 
o di.stantTUFs 
□ proximal TUFs 



LI 



■ Mean intron size 
□ TUFs per intron 



CM Chi-II Chilli ChrlV ChrV ChrX 
Chi'omosome 



1 2 3 4 5 6 7 8 
Intron position 



150 -, 



100 



3 



50 - 



m 



-2000 -1500 -1000 -500 

Upstream (bp) 



150' 



o 
c 

3 



100- 



50 - 



Coding gene 



: 0 



500 1000 1500 2000 

Downstream (bp) 



Figure 3. TUF genomic distribution. (A) GC content distribution for TUFs and other genomic regions. The GC contents of exons, introns and intergenic 
regions were calculated using 10 000 randomly selected exons, introns and 100 bp intergenic fragments. (B) Chromosoinal distribution of known is-ncRNA 
and TUF loci. (C) Intron position, length and TUF density. Average intron size (420.56 nt) and TUF density (0.055 TUFs/intron) values were scaled to 1 . 
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the 5' termini of annotated genes (Figure 3D). Together 
with the 1841 intronic TUFs, 4594 TUFs (78.3%) are 
found within or in the vicinity of a protein coding 
sequence. 

The frequency of TUFs within a given intron varied 
considerably with intron position and the average 
number of TUFs in 5' UTR introns (intron 0) was 
~4 times higher than in later introns within the CDS 
region (Figure 3C). Longer intron 0 is a general 
property of the eukaryotic gene structure (43), and 
introns located at 5' proximal of genes have been shown 
to have important functional properties, often related to 
gene expression (44,45). Using reference gene annotation 
data from UCSC, we calculated intron lengths for 23 665 
high-confidence genes. The average length of the intron 0 
(Figure 3C) was ~2.5 times longer than other introns 
{P<0.01, Z-test); however, despite the longer size of 
intron 0, its TUF density (1 TUF/11.5kb of intron 
sequence) was nearly twice that of other introns (1 TUF/ 
18.5 kb of sequence). 

Most TUFs show stage-dependent expression 

There were no particular differences in average expression 
levels of TUFs located in the four different genomic 
regions (Supplementary Figure S3A). However, unlike 
most of the previously known is-ncRNAs, which show 
relatively uniform expression through worm development, 
TUF expression commonly fluctuated considerably across 
stages (Supplementary Figure S3B). There were marked 
differences in the number of TUFs that were expressed 
above cutoff (L2SI = 6) in each developmental stage or 
condition (Figure 4A), and 3975 (67.8%) of the TUFs 
were expressed in only one stage or condition. Only 70 
TUFs (1.2%) were ubiquitously expressed in all stages 
and conditions, these, however, were much more 
strongly expressed (mean L2SI = 11.4) than TUFs ex- 
pressed in a single stage or condition (mean L2SI = 6.4). 
The number of TUFs expressed at the first larval stage 
was markedly lower than in the older larval and the 
MAs (Figure 4A) and the number expressed in ML 



worms was ~20% lower than in the general (mostly herm- 
aphrodite) MA populations. The number of expressed 
TUFs was lowest in the DU stage, which probably 
reflects the generally reduced physiological activity at 
this stage. HS worms had a high number of expressed 
TUFs, and also displayed the highest number of TUFs 
that were specifically expressed at any stage or condition 
(427; Figure 4B). However, a high number of specifically 
expressed TUFs was also seen in the ML stage (229; 
Figure 4B), despite the relative low total number of 
expressed TUFs at this stage, suggesting that a dispropor- 
tionally high number of small transcripts may be required 
for attaining or maintaining this specific stage. Also, early 
worm development (LI) was associated with a relatively 
high number (65) of specifically expressed TUFs, despite 
the overall low number of TUFs expressed. A small 
number of TUFs appeared to be present at all but one 
specific stage, such as the DU (6) and ML (7) stages and 
the HS (5). 

TUFs show dual conservation distributions 

In order to estimate the conservation levels of intergenic 
and intronic TUFs, PhastCons scores from six nematode 
genomes were downloaded from UCSC (27). To compare 
the conservation distribution of TUFs to annotated tran- 
scriptional units, we calculated phastCons scores for all 
known is-ncRNAs, and for randomly selected exonic, 
intronic and intergenic fragments (100 bp for 10 000, 
times, respectively). Compared with known is-ncRNAs, 
which are generally weU conserved among the nematodes, 
both intergenic and intronic TUFs displayed a dual distri- 
bution, in which ~40% of the TUF are almost completely 
non-conserved (average phastCons score <0.2), and the 
majority of the remaining TUF having PhastCons scores 
in the range 0.4-0.8 (Figure 5A and B). Both intronic and 
intergenic TUFs have distributions that differ in shape 
from those of their respective genomic environments 
(i.e. randomly selected introns and intergenic sequences), 
but only intronic TUFs had significantly (i'<0.01, 
Wilcoxon's test) higher phastCons scores than introns in 
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of TUFs expressed in only one (or one group of) stage (s) ('enriched') or expressed in all (or nearly all) stages but one ('reduced'). The data 
corresponding to the shaded area were not considered. 



Nucleic Acids Research, 2011, Vol. 39, No. 12 5209 



general. Only 20% of all TUFs had phastCons scores at 
the same level as most known is-ncRNAs (average 
phastCons score > 0.7; Figure 5B); however, these 
approximately ~1100 TUFs represent more than twice 
the present number of intermediate-size transcript loci in 
C. elegans with phastCons scores at this level. An add- 
itional peculiarity was that TUFs located on the X 
chromosome had significantly (/" < 2.2 x 10~'^, 
Wilcoxon's test) higher phastCons scores than autosomal 
TUFs (Figure 5C). TUFs on chromosomes I and IV had 
somewhat lower phastCons scores {P < 0.01 and P < 0.05, 
respectively) than TUFs on the remaining autosomes, 
but the differences were far less pronounced than that 
between the X chromosome and the autosomes. Except 
from being conspicuously absent immediately down- 
stream of coding loci, TUFs with high phastCons scores 
showed no particular distribution on the X chromosome 
(Figure 5D). 

Further analysis of the relationship between TUFs 
and sequence conservation revealed several interesting 
correlations. There were 1895 TUFs that were located 
within 1 kb upstream of 1714 protein coding loci, and 
phastCons scores of coding loci with at least one TUF 
located upstream of their annotated 5'-termini had on 



average significantly {P<0.0\, Z-test) higher phastCons 
scores than other coding genes (Figure 5E). There was 
also a tendency that immediate up- and down-stream 
flanking regions (<500bp) of coding genes displayed 
higher phastCons scores if a TUF was located within the 
flanking region (Figure 5E). To further analyze poten- 
tial interactions between TUFs and neighboring coding 
genes, coding gene expression profile data from the 
C. elegans Gene Expression Consortium (see 'Methods' 
section) were compared with expression data from the 
larval stages (L1-L4). We obtained expression profiles 
for 1872 coding genes with at least one or more flanking 
or intronic TUFs, and established 2623 TUF-gene pairs, 
consisting of one TUF and its nearest gene; however, 
94.5% of TUFs showed httle or no expressional correl- 
ation with their neighboring genes. A recent analysis 
has found that mouse neuronal enhancers to be 
commonly transcribed (12). However, a re-analysis of 
C. elegans DNase I hypersensitive sites (46) failed to 
show any substantial co-location of TUFs and potential 
cr.y-regulatory elements (Supplementary Figure S4), and 
transcriptional activation of enhancers may thus not be 
a prominent feature of the cLv-regulatory mechanism in 
the worm. 
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Figure 5. Conservation analyses. (A) Distribution of TUF phastCons scores. (B) PliastCons score distributions of known is-ncRNAs, TUFs and 
100 bp randomly selected intronic, exonic and intergenic sequences. Intronic TUFs have significantly higher phastCons scores than that of introns 
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Motifs, repeats and conserved structure 

A search for conserved structures using the Infernal 
software (30) against the Rfam database (31) yielded 
33 hits. Among these, four were mammahan miRNAs, 
26 were sequences that the software assigned as 'HIV- 
related signal', two were prokaryotic small RNA or 
mRNA leader sequences, and one was C. elegans 
snoRNA U6-47, which for some reason has not yet been 
included into Wormbase. Secondary structure analysis 
predicted 42 and 23 novel C/D box and H/ACA box 
snoRNAs, respectively (Supplementary Document SI). 

Previous analyses of is-ncRNA loci in C. elegans have 
revealed the presence of common upstream motifs with 
putative or verified promoter activity (4,47). A search 
for known upstream sequence motifs in 200 bp sequence 
flanking each was performed with the MAST software 
(38). The flanking (and most likely upstream) sequence 
of 15 TUFs displayed one of the three previously 
reported C. elegans is-ncRNA promoter motifs (UMl- 
UM3, £'<0.01) (4,5). Eleven of these were UMl, corres- 
ponding to the snRNA proximal sequence element (PSE) 
previously identified in C. elegans (48). The MEME 
software was also applied to search both strands of the 
TUFs for internal sequence motifs, yielding one novel 
internal motif (IM4), which was shared by 8 TUFs and 
formed part of a predicted stem-loop structure (Figure 6). 



About one-third of the TUFs (1865) were flanked 
within 200 bp by repeats from the UCSC RepeatMasker 
annotation (2436 repeats in total). Approximately half 
(1231) were simple repeats, the majority being AT-rich 
(758), and the other half complex repeats of which 
CELE14B was the most frequent type. The distance 
distribution of the repeats showed a conspicuous peak at 
10-15 bp away from the TUF (Supplementary Figure S5), 
which might result from the removal of repetitive region in 
the design of the chip. A certain fraction of C. elegans 
repeats might act as promoters to initiated the transcrip- 
tion for downstream sequences as that in other organisms 
(49). The CELE14 MITE repeat family occurs 3020 times 
in the C. elegans genome, mostly clustered near the ends of 
the autosomes (50) and the CELE14B repeat was observed 
flanking 80 TUFs. In a few cases, CELE14B was flanking 
a TUF on both sides, leaving little room for pro- 
moter sequences to be located elsewhere than within 
the CELE14B sequences themselves (Supplementary 
Figure S6). 



DISCUSSION 

Using tiling array technology, we have profiled the 
intermediate-size (70-500 nt) transcriptome of C. elegans 
through eight developmental and conditional stages. After 
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stringent filtering, the analysis altogether included 32 230 
transfrags, of which 5866 were located in non-annotated 
intergenic or intronic regions. Most of these potential 
RNAs exhibited distinct features common to known 
classes of is-ncRNAs, and a fraction of them shared 
novel upstream or internal motifs. The data give an elab- 
orate view of the chromosomal distribution, conservation 
and expression profiles of this segment of the C. elegans 
transcriptome, and the number and expressional com- 
plexity of the novel transcripts suggest roles in nematode 
development and phenotypic specification (51). We have 
previously estimated that there may be 3000-4000 
is-ncRNA loci in C. elegans (4,25) and a previous analysis 
of mix-stage worm suggested even higher numbers (5). 
Taking into account the apparent developmental and con- 
ditional specificity of TUF expression, analysis of add- 
itional developmental stages (e.g. egg, embryo, aging, 
etc.) and conditions might increase the number to a level 
of 7000-10000 intermediate-size transcripts. 

The emergence of second generation sequencing 
(RNA-Seq) data have cast doubt on the quality and cor- 
rectness of mammalian tihng array data (13). Tihng arrays 
has generally higher false-positive rates than RNA-Seq 
data and non-coding transcription appear to be con- 
centrated within and around coding loci rather than per- 
vasive throughout the entire genome (13). Contrary to the 
human tiling microarrays analyzed by van Bakel et al. 
(13), the C. elegans tihng array is designed with both 
perfect matched and mismatched probes, which facilitates 
stringent data normalization and filtering. An analysis of 
C. elegans tiling array and RNA-Seq data found good 
general agreement in expression levels at identical loci in 
the two data, and 86% of the loci identified by the tiling 
array as differentially expressed between two developmen- 
tal stages were confirmed by the RNA-Seq data (26). If 
anything, the tiling array data underestimated the number 
of differentially expressed loci (26), and the variation 
in TUF expression levels across nematode development 
may actually be even more pronounced than reported 
here. The risk of high false-positive rates owing to 
cross-hybridization could mainly be limited to a 'black 
fist' of 2327 regions (26) of which only one remained 
after our own (independent) filtering of the data. Our val- 
idation rates (~85%) were higher than those reported for 
human tiling array data (25-70%, (6,52), despite the fact 
that the TUFs used for vahdation were selected among 
those with lowest expression levels (6 < L2S1 < 8). 

We have assumed that the majority of the novel 
intermediate-size transcripts are not translated into pep- 
tides. A recent study in Drosophila have shown that some 
transcripts previously considered to be non-coding RNAs 
contain short open reading frames encode 11-32 amino 
acid long bioactive peptides involved in temporal regula- 
tion of epidermal morphogenesis (53). Although a number 
of strategies have been developed to distinguish protein- 
coding RNAs from ncRNAs, the distinction between the 
protein-coding and non-coding categories is not entirely 
clear (1), and the existence of bifunctional RNAs further 
contribute to this confusion (54,55). However, short 
(<100 codons) active ORF tend to reside in transcripts 



that are considerably longer than those identified in this 
study (53,56). 

The higher density of TUFs dwelling within or close to 
coding genes may reflect some aspect of their biogenesis or 
function. One distinct feature in C. elegans is that nearly 
half of the gene upstream TUFs were detected upstream of 
trans-splice acceptor sites or within operons, suggesting 
the possibility that some TUFs might represent outrons 
(57) resulting from trans-splicing. In yeast, CUTs are fre- 
quently associated with promoters of coding genes (41). In 
Arahidopsis, UNTs are collinear with the 5' ends of known 
mRNAs and frequently extend into the first intron of 
respective overlapping genes. The possibility that UNTs 
derive from the pre-mRNA is highly improbable, as some 
UNTs are more abundant than the corresponding gene. 
Moreover, mapping of human transcriptome has also 
revealed an abundance of PASRs, possibly produced 
by pervasive or bidirectional transcription of promoter 
regions depleted of nucleosomes (6,10,14). The TASRs 
may be generated by similar mechanism (58). However, 
one caveat to such an interpretation of the data is 
provided by a previous analysis which have shown that 
intronic is-ncRNA loci in C. elegans commonly have in- 
dependent promoter activity and show little or no expres- 
sional correlation to the protein-coding loci within which 
they reside (59). 

The peculiar fact that the presence of a proximal TUF 
correlated with the conservation level of nearby coding 
gene is suggestive of some sort of functional relationship. 
Previous studies in yeast have found that ncRNA SRGl 
could interfere with the promoter of downstream SER3 
stress-responsive gene by blocking the binding of tran- 
scriptional factors (9). On the other hand, as the expres- 
sion levels of most TUFs were not correlated with those of 
their respective proximal coding genes, and previous 
analyses have demonstrated a high degree of transcrip- 
tional independence even for C. elegans intronic 
is-ncRNAs (4,59), most TUF loci may well be transcrip- 
tionally and functionally independent of neighboring 
coding loci. 

The novel TUFs differ from known is-ncRNAs in 
several aspects. Most of the previously known 
is-ncRNAs are well conserved across a wide range of or- 
ganisms, whereas the novel TUFs show a dual conserva- 
tion distribution, with approximately one-half of the 
TUFs being conserved within nematodes, and the rest 
apparently being specific to C. elegans. The failure to 
identify flanking sequence motifs suggests that most 
TUF loci do not have recognizable promoter and termin- 
ator sequences. This may result from most TUFs being 
by-products of the nearby transcriptional processes; 
however, a number of previously verified, but functionally 
uncharacterized is-ncRNA loci show a similar lack of 
canonical structures (4). Moreover, known is-ncRNAs 
such as snRNAs and snoRNAs are generally pervasively 
expressed through most developmental stages, whereas 
most TUFs showed fluctuant expression across stages. 
This tendency of novel ncRNAs to display stage-specific 
expression was also observed in recently pubhshed studies 
of the C. elegans transcriptome (60,61), corroborating the 
data in this study. The observation that a small number 
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of TUFs appear to be present at all but one specific stage 
is curious. Though absence of evidence is not evidence of 
absence, it is tempting to speculate that there may exist a 
small number of transcripts whose presence is required for 
the 'default' state of the worm (i.e. rapid development 
toward the mature hermaphrodite), or alternatively, 
whose removal is required for the entrance into the ML 
or DU stages, respectively. Also, the average GC content 
of the TUFs was lower than that of known is-ncRNAs, 
and was closer to that of protein-coding exons. Thus, the 
previously known is-ncRNAs may, with some notable ex- 
ceptions, be an assembly of highly and stably expressed 
loci that are not very representative of the expressional 
and functional complexity of intermediate-size transcrip- 
tome in C. elegans. 

A substantial fraction of the TUFs showed no or nearly 
no conservation in other nematode species. However, lack 
of conservation does not necessarily imply lack of 
function, inasmuch as some of the genomic information 
(coding and non-coding) will be required to specify 
C. elegans as a distinguishable nematode species. With 
the exception of chromosomal location, the TUF 
phastCons score distribution was not related to any 
other TUF characteristic, thus TUFs with high and low 
phastCons scores were indistinguishable with respect 
to genomic localization, stage of expression and signal 
intensity distribution (Supplementary Figure S7). 
Consequently, the data suggest that there are substantial 
numbers of transcriptionally active genetic elements in 
C. elegans that are not conserved in other nematodes. 
These unconstrained TUFs may belong to a large pool 
of neutral elements that are biologically active but 
non-orthologous between nematodes (62), and it is 
possible that the non-conserved TUFs may play import- 
ant roles in distinguishing C. elegans from other nema- 
todes (41). 

The functional roles of this large complement of novel 
loci can at present only be speculated. Many TUFs 
showed male-specific expression. Combined with the 
finding that the X chromosome was enriched for TUF 
loci, this strongly suggests involvement of these loci in 
sex determination or gender-specific functions. The lack 
of conservation outside the nematodes, and for a large 
fraction of the TUF loci, even within the sequenced 
nematode genomes, might suggest functional roles in spe- 
cifying the nematode lineage, or even roles in distinguish- 
ing C. elegans from other nematodes. High numbers of 
non-conserved transcripts are also identified in several 
other organisms. In yeast, >80% of the 185 novel CUTs 
were produced from genomic regions with low conserva- 
tion scores even among closely related yeast species (41), 
and in human, most of the detected unannotated 
transcribed sequences appear not to be strongly conserved 
in the mouse genome (63,64). On the other hand, since 
RNA function commonly depends more on secondary 
structure than primary sequence, it cannot be excluded 
that conserved (i.e. identical or similar) functions are 
executed by RNA loci that are no longer alignable. A 
significant number of human genomic regions not 
alignable to the mouse genome were found to have signa- 
tures of RNA structure and were twice as likely to overlap 



tiling array detected transfrags (18). A search for rapidly 
evolving sequences in the human linage identified a 
non-coding RNA with brain-specific expression (65), and 
a possible interpretation of the non-conserved TUFs is 
that they derive from genomic regions that exhibit recent 
evolutionary change. Mouse eRNAs has been 
hypothesized as an evolutionary source for new genes 
(66), and a fraction of the TUFs that are specific to 
C. elegans may well provide a warehouse of neutral 
elements available for further evolution. 
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